Topic-specific Web Searching based on a Real-text Dictionary

نویسنده

  • Ari Pirkola
چکیده

The contributions of this paper are twofold. First, we present a new type of dictionary that is intended as a search assistance in topic-specific Web searching. The method to construct the dictionary is a general method that can be applied to any reasonable topic. The first implementation deals with climate change. The dictionary has the following new features compared to standard dictionaries and thesauri: (A) It contains real-text phrases (e.g. rising sea levels) in addition to the standard dictionary forms (sea-level rise). The phrases were extracted automatically from the pages dealing with climate change, and are thus known to appear in the pages discussing climate change issues when used as search terms. (B) Synonyms, i.e., different spelling, syntactic, and short form variants of the phrase are grouped together into the same entry (synonym set) using approximate string matching. (C) Each phrase is assigned an importance score (IS) which is calculated based on the frequencies of the phrase in relevant pages (i.e., pages on climate change) and nonrelevant pages. Second, we investigate how effective the IS is for indicating the best phrase among synonymous phrases and for indicating effective phrases in general from the viewpoint of search results. The experimental results showed that the best phrases have higher ISs than the other phrases of a synonym set, and that the higher the IS is the better the search results are. This paper also describes the crawler used to fetch the source data for the climate change dictionary and discusses the benefits of using the dictionary in Web

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Real-time tagging of biomedical entities

Automatic annotation of text is an important complement to manual annotation, because the latter is highly labor intensive. We have developed a fast dictionary-based named entity recognition system, which is used for both real-time and bulk processing of text in a variety of biomedical web resources. We propose to adapt the system to make it interoperable with the PubAnnotation and Open Annotat...

متن کامل

Kaleta SEMANTIC TEXT INDEXING

The following article presents a specific issue of semantic analysis of texts in natural language – text indexing and describes one field of its application (web browsing). The main part of this article describes a computer system assigning a set of semantic indexes (similar to keywords) to a particular text. The indexing algorithm employs a semantic dictionary to find specific words in a text ...

متن کامل

Unsupervised Learning of Visual Sense Models for Polysemous Words

Polysemy is a problem for methods that exploit image search engines to build object category models. Existing unsupervised approaches do not take word sense into consideration. We propose a new method that uses a dictionary to learn models of visual word sense from a large collection of unlabeled web data. The use of LDA to discover a latent sense space makes the model robust despite the very l...

متن کامل

Automatic Discovery of Similar Words

We deal with the issue of automatic discovery of similar words (synonyms and near-synonyms) from different kind of sources: from large corpora of documents, from the Web, and from monolingual dictionaries. We present in detail three algorithms that extract similar words from a large corpus of documents and consider the specific case of the World Wide Web. We then describe a recent method of aut...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012